Objective

In the first chapter of Data Science For Public Policy, I learned how to work with spatial census data in R. Through the excercises, we queried acs census tract level demographic and economic data from philidelphia and paired it with philidelphia’s public transit system spatial data to see if people were willing to pay a premium to live in transit oriented development.

For my own personal analysis, I want observe the relationship between bike lanes and various demograpic statistics in Sacramento California, where I live. Similar to the analysis in the textbook. I don’t think I’ll confidently be able to infer any causation. But, I believe that I will find that areas with more bike lanes will have a higher percentage of working age young adults living nearby, and possibly an increasing number of working age young adults living nearby over time. I also plant to look at statistics like average rent, percent bachelors degree and more.

I also want to see the relationship of home values, rents, and crime with the prevalence of affordable housing in a neighborhood. One of the most typical arguments of a NIMBY homeowner is that having affordable housing will bring down their home value. Another argument is that it will bring criminals and therefore crime into their neighborhoods. I imagine that there may potentially be some truth to the first point, though very marginal. I have greater doubts about the second point. Let’s see! to fully do this analysis, I will likely need to utilize a dif in dif test. I imagine we will go over this in later chapters.

Working with data outside the scope of the textbook is a learning experience in and of itself. below I am going to do my best to document everything that I’ve learned and the problems I had to solve.

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidycensus)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(ggplot2)

options(scipen=999) #says no scientific notation
options(tigris_class = "sf") # tells tidycensus to download in sf layers
#hopefully this will be all the packages I need
census_api_key("c08e727b3d32b823955aea424744c9dda3b30df0", install = TRUE, overwrite = TRUE)
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
## To use now, restart R or run `readRenviron("~/.Renviron")`
## [1] "c08e727b3d32b823955aea424744c9dda3b30df0"
# readRenviron("~/.Renviron") is for some reason needed for the api to be active in the current session 
readRenviron("~/.Renviron")
#making sure that my api key for the census is working

working with the census data api is not super intuitive!

It’s been frustrating trying to figure out the best way to understand what each variable code name really stands for. there is no website it seems that lays out in an intuitive key for the variable code names. Additionally, I could not find a simple way to find comprehensive metadata for each dataset.

The best way to find the right variables seems to be to use the load_variables function. for the dataset argument, you have to do your best to guess the correct option. for some it’s easier than others. For example, it’s intuitively obvious that the "acs1" option corresponds to the 1 year ACS. However, I’d have to do some internet digging to figure out what the hell the "cd110h" dataset is.

Anyway, once you’ve loaded your variables, it seems the best way to find the exact variable you’re looking for is by using text search within the table. So, for example, if you want to use the American Community Survey 1 year estimates from 2021, you would use the code below to load the full detailed list of variables into the viewing pane in the top left.

view(load_variables(2021, "acs1"))

now that we have this table loaded, we notice that we have three columns: name, label, and concept.

Name is the variable code name corresponding to the variable column that is returned be returned in the get_acs function.

Concept I believe represents the census “table” that the variable described in the label column belongs to. When you go to data.census.gov, you will find a variety of “tables” containing variables of particular categories. For example, the table called “AGE AND SEX” contains statistics on, you guessed it, age and sex. the Concept column corresponds to these table names.

Certain statistics repeat across tables. For example, a lot of tables include a total population row. For comparison purposes, you probably want to see the total population of an area when you’re looking at a table of AGE BY SEX as well something like employment. So, it makes sense for total population to repeat. When we load_variables, unless we specify a particular table as an argument, we load multiple tables. And, because some statistics repeat across tables, we will have multiple rows with the same statistic. Confusingly, Each table is made up by variables with unique codenames. I checked to make sure this was true by loading multiple instances of statistics that appeared to represent the same thing. below is a get_acs call to return the median rent of the total population (Estimate!!Median gross rent –!!Total:) in Sacramento County from both the MEDIAN GROSS RENT BY BEDROOMS concept/table as well as the median rent of the total population (Estimate!!Median gross rent –!!Total:) from the MEDIAN GROSS RENT BY YEAR STRUCTURE BUILT concept/table. As you will see below, these two variables return the same exact value.

print(
    get_acs(geography = "county",
            state = "CA",
            county = "Sacramento",
            geometry = TRUE,
            year = 2021, survey = "acs1",
            variables = c("B25031_001", "B25111_001")
           )
     )
## Getting data from the 2021 1-year ACS
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |==========================================================            |  84%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%
## Simple feature collection with 2 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -121.8625 ymin: 38.01842 xmax: -121.0271 ymax: 38.7364
## Geodetic CRS:  NAD83
##   GEOID                          NAME   variable estimate moe
## 1 06067 Sacramento County, California B25031_001     1512  27
## 2 06067 Sacramento County, California B25111_001     1512  27
##                         geometry
## 1 MULTIPOLYGON (((-121.8625 3...
## 2 MULTIPOLYGON (((-121.8625 3...

Now let’s check a different variable. I am going to check the total population variable since it appears so often across tables. In one instance I see that there is this variable: Name:B01003_001 Label: Estimate!!Total concept: TOTAL POPULATION I am quite confident that this represents the estimate for the total population. the concept/table being called TOTAL POPULATION makes me quite confident

I also see a variable B25008_001 Estimate!!Total: TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE

I’m going to guess that despite having the same label and similar concepts that these are different statistics with two different values. This is because not everyone in an area will be in “occupied housing”, whatever that means. I’m guessing that just means not homeless. lets see what we get

print(
    get_acs(geography = "county",
            state = "CA",
            county = "Sacramento",
            geometry = TRUE,
            year = 2021, survey = "acs1",
            variables = c("B01003_001", "B25008_001")
           )
     )
## Getting data from the 2021 1-year ACS
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Simple feature collection with 2 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -121.8625 ymin: 38.01842 xmax: -121.0271 ymax: 38.7364
## Geodetic CRS:  NAD83
##   GEOID                          NAME   variable estimate  moe
## 1 06067 Sacramento County, California B01003_001  1588921   NA
## 2 06067 Sacramento County, California B25008_001  1563704 5415
##                         geometry
## 1 MULTIPOLYGON (((-121.8625 3...
## 2 MULTIPOLYGON (((-121.8625 3...

Well, that’s about what I expected. we get an estimate of 1.58 mil for the total population and 1.56 mil for the total population in occupied housing.

What about this variable?: B01001_001 Estimate!!Total: SEX BY AGE

We can infer that this variable is from the SEX BY AGE table. We can infer that its an estimate of a total of something. What is that something? Sexes? unlikely, I think that this is also a variable representing the total population. I am guessing this because if you were accessing the SEX BY AGE data on the census website, the table would likely give a row with the total population for the user to use as a comparison tool. let’s test my hypothesis by querying it alongside the total population variable for comparison.

print(
    get_acs(geography = "county",
            state = "CA",
            county = "Sacramento",
            geometry = TRUE,
            year = 2021, survey = "acs1",
            variables = c("B01003_001", "B01001_001")
           )
     )
## Getting data from the 2021 1-year ACS
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Simple feature collection with 2 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -121.8625 ymin: 38.01842 xmax: -121.0271 ymax: 38.7364
## Geodetic CRS:  NAD83
##   GEOID                          NAME   variable estimate moe
## 1 06067 Sacramento County, California B01001_001  1588921  NA
## 2 06067 Sacramento County, California B01003_001  1588921  NA
##                         geometry
## 1 MULTIPOLYGON (((-121.8625 3...
## 2 MULTIPOLYGON (((-121.8625 3...

Just as I suspected, these are the same variable. So, why was I confident that Estimate!!Total: from SEX BY AGE would be the total population while Estimate!!Total: from TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE would not? What made me confident was the fact that there was no other rows in the occupied housing table that could have conceivably represent any sort of total population. I relied on subtle context clues. Why wouldn’t someone interested in occupied housing want to compare the total population of those in occupied housing to the total population in any housing situation? No idea. I feel like that a total overall population would fit naturally into the occupied housing table. To my point…it’s not very intuitive.

other challenges

Which datasets can I use?

I am realizing from the calls above the one year ACS data from 2021 that for that particular dataset, it can only produce data for geographies 65,000 people and larger. It gives that warning after the query, I imagine that’s true for all of 1 year ACS datasets. Let’s check.

print(
    get_acs(geography = "county",
            state = "CA",
            county = "Sacramento",
            geometry = TRUE,
            year = 2019, survey = "acs1",
            variables = c("B01003_001", "B01001_001")
           )
     )
## Getting data from the 2019 1-year ACS
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%
## Simple feature collection with 2 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -121.8625 ymin: 38.01842 xmax: -121.0271 ymax: 38.7364
## Geodetic CRS:  NAD83
##   GEOID                          NAME   variable estimate moe
## 1 06067 Sacramento County, California B01001_001  1552058  NA
## 2 06067 Sacramento County, California B01003_001  1552058  NA
##                         geometry
## 1 MULTIPOLYGON (((-121.8625 3...
## 2 MULTIPOLYGON (((-121.8625 3...
print(
    get_acs(geography = "county",
            state = "CA",
            county = "Sacramento",
            geometry = TRUE,
            year = 2015, survey = "acs1",
            variables = c("B01003_001", "B01001_001")
           )
     )
## Getting data from the 2015 1-year ACS
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%
## Simple feature collection with 2 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -121.8625 ymin: 38.01842 xmax: -121.0271 ymax: 38.7364
## Geodetic CRS:  NAD83
##   GEOID                          NAME   variable estimate moe
## 1 06067 Sacramento County, California B01001_001  1501335  NA
## 2 06067 Sacramento County, California B01003_001  1501335  NA
##                         geometry
## 1 MULTIPOLYGON (((-121.8625 3...
## 2 MULTIPOLYGON (((-121.8625 3...

Yep, all the 1 year data can only be applied to geographies with at least 65,000 people

The ACS data is released on a rolling basis, and each year’s 5-year estimates represent data collected over a 5-year period. for example, if we set the year argument in a get_acs query to 2021, that data that it populate will be from surveys taken in the years 2017-2021.

Since I have been using the ACS 1 year estimates data for the examples above, I am going to look through the metadata from the 5 year ACS.

view(load_variables(year = 2021, dataset = "acs5"))

Seems like the variables have the same names as the 1 year 2021 census

filtering for tracts only in the city of sacramento

I want to just analyze Sacramento city, but I am not sure if that is an available geography in the ACS. Let’s check by doing a get_acs call in which we set the geography = "city"

Yeah, that didn’t work. I think what we are going to have to do is download the whole county data and then filter out the area outside of the city.

I’m think that the best way to do this will be to download a map of the city, plot it as an sf layer with the tract level county data, see how the borders overlap, and then decide on the best way to filter the tracts.

I found a map of the cities that make up sacramento county on this website

The data seems queriable. I’m going to refer back to the textbook for how to correctly import this map. Note that I am transforming the CRS from a global geodetic to local projected. The textbook doesn’t do a good job of explaining CRS and its importance. It can sort of be thought of like units of measurement. Some are more appropriate than other given certain situations. you measure human weight in pounds and magic mushrooms in grams. You could do it the other way around, but that wouldn’t be intuitive and there may be more room for error. We start with a map of a portion of the oblate spheroid globe as reference points and measure distances in global degrees, and we transform it to a flat projected map that uses feet as distances: much more appropriate.

sac_county_borders <-
        st_read("https://services1.arcgis.com/5NARefyPVtAeuJPU/arcgis/rest/services/CityBoundarieswithUnincorporated/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>%
    st_transform('EPSG:6418')
## Reading layer `OGRGeoJSON' from data source 
##   `https://services1.arcgis.com/5NARefyPVtAeuJPU/arcgis/rest/services/CityBoundarieswithUnincorporated/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 8 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -121.8632 ymin: 38.01815 xmax: -121.0271 ymax: 38.73595
## Geodetic CRS:  WGS 84
head(sac_county_borders)
## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 6671040 ymin: 1819000 xmax: 6824664 ymax: 2025806
## Projected CRS: NAD83(2011) / California zone 2 (ftUS)
##   OBJECTID               DISTRICT      CITY_NAME ID Shape__Area Shape__Length
## 1        1 CITY OF CITRUS HEIGHTS CITRUS HEIGHTS  1   396452761      97224.78
## 2        2        CITY OF ISLETON        ISLETON  2    12736535      26552.23
## 3        3 CITY OF RANCHO CORDOVA RANCHO CORDOVA  3   970702150     214876.84
## 4        4         CITY OF FOLSOM         FOLSOM  4   839554877     153436.00
## 5        5      CITY OF ELK GROVE      ELK GROVE  5  1190697956     219186.96
## 6        6           CITY OF GALT           GALT  7   217961058      99574.52
##                         geometry
## 1 MULTIPOLYGON (((6777373 202...
## 2 MULTIPOLYGON (((6677594 182...
## 3 MULTIPOLYGON (((6777261 196...
## 4 MULTIPOLYGON (((6813005 202...
## 5 MULTIPOLYGON (((6768633 192...
## 6 MULTIPOLYGON (((6768285 186...
print(st_crs(sac_county_borders))
## Coordinate Reference System:
##   User input: EPSG:6418 
##   wkt:
## PROJCRS["NAD83(2011) / California zone 2 (ftUS)",
##     BASEGEOGCRS["NAD83(2011)",
##         DATUM["NAD83 (National Spatial Reference System 2011)",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",6318]],
##     CONVERSION["SPCS83 California zone 2 (US Survey feet)",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",37.6666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-122,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",39.8333333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",38.3333333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",6561666.667,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",1640416.667,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United States (USA) - California - counties of Alpine; Amador; Butte; Colusa; El Dorado; Glenn; Lake; Mendocino; Napa; Nevada; Placer; Sacramento; Sierra; Solano; Sonoma; Sutter; Yolo; Yuba."],
##         BBOX[38.02,-124.06,40.16,-119.54]],
##     ID["EPSG",6418]]

Yay! we successfully queried the geoJSON data like in the textbook and we get a sf layer

Let’s try plotting it without transforming it in any way.

ggplot() +
    geom_sf(data = sac_county_borders) 

Well would you look at that. That looks like Sacramento county if I ever saw it! Now lets get just the city

First, I want to take a look at my table of variables to understand what I need to select

view(sac_county_borders)

seems like Sacramento city is the 8th row, so the following code should create a new sf table with one row representing the city of Sacramento

sac_city_border <-
    sac_county_borders[8,]
view(sac_city_border)

seems to have worked. now lets plot it.

ggplot() +
    geom_sf(data = sac_city_border) 

I’m a bit confused by that little chunk that missing inside of the city. I’m going to guess that that is unincorporated?? let’s check

ggplot() +
    geom_sf(data = sac_county_borders[7,]) 

yep, that little chunk appears to be unincorporated. interesting. Wonder how that happened.

now lets try layering the city on top of census tract level data of the county.

I’m first going to query the 5 year tract level census data and then view and plot it to see what it looks like.

sac_county_acs <- 
     get_acs(geography = "tract",
            state = "CA",
            county = "Sacramento",
            geometry = TRUE,
            year = 2021, survey = "acs5",
            variables = "B01003_001"
           ) %>%
    st_transform(st_crs(sac_county_borders)) #note that we are matching CRSs 
## Getting data from the 2017-2021 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%
head(sac_county_acs)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 6711842 ymin: 1905913 xmax: 6763745 ymax: 2020192
## Projected CRS: NAD83(2011) / California zone 2 (ftUS)
##         GEOID                                              NAME   variable
## 1 06067008119 Census Tract 81.19, Sacramento County, California B01003_001
## 2 06067006102 Census Tract 61.02, Sacramento County, California B01003_001
## 3 06067009651 Census Tract 96.51, Sacramento County, California B01003_001
## 4 06067009644 Census Tract 96.44, Sacramento County, California B01003_001
## 5 06067008129 Census Tract 81.29, Sacramento County, California B01003_001
## 6 06067009610 Census Tract 96.10, Sacramento County, California B01003_001
##   estimate  moe                       geometry
## 1     6915  848 MULTIPOLYGON (((6758527 200...
## 2     3914  508 MULTIPOLYGON (((6732697 198...
## 3     5516  609 MULTIPOLYGON (((6728531 190...
## 4     3312  542 MULTIPOLYGON (((6711842 191...
## 5     2796  551 MULTIPOLYGON (((6753713 201...
## 6     7101 1093 MULTIPOLYGON (((6720102 193...
ggplot() +
    geom_sf(data = sac_county_acs)

That all looks like how we want it to look.

Now, I want to layer my city outline on top of my census tract map to heuristically see if the city tracts fall perfectly into the city borders. In other words, I want to see if the borders cut through each other, in which case I will have a new challenge filtering for the tracts I want.

ggplot() +
    geom_sf(data = sac_county_acs) +
    geom_sf(data = sac_city_border, aes(color = "red", alpha = .5)) 

Annoyingly, we learn that the census plots do not indeed fall neatly into the city boundaries. It seems like we are going to need to selct which tracts count as as part of the city and which do not. I think I will use the select by centroid approach used in the book.

But first, I want to investigate the purpose of the CRSs and see how they may be messing up what I am doing here.

I found a document called “overview of CRS in R which says:”In R, when data with different CRS are combined it is important to transform them to a common CRS so they align with one another. This is similar to making sure that units are the same when measuring volume or distances.”

After a somewhat significant amount of reading, I think I have a decent grasp of how a coordinate reference system works. It’s a framework for places geographic points relative to each other. At the largest geographic level a CRS can encompass the entire globe and calculate distances and geographies along the oblate spheroid that is our earth. It seems that a CRS that encompasses the entire earch can feasibly work for any region in the world, however, that comes at the cost of accuracy when we are looking at data in small geographic areas. These large area CRSs also use units of measurement like degrees which arent suitable for precise areas. CRSs are easily convertible. For My case I want to take my data which seems to be mapped using a geodetic system and convert it to a projected system that is more accurate to the Sacramento area.

Since it seems clear that the tracts and the city borders don’t completely match, We have to choose which tracts to keep. I’m going to use the centroid method from the textbook. So, ultimately, the tracts that I will be left with will be only those whose geographic center of gravity fall within the cith of Sacramento.

sac_centroid_select <-
    st_centroid(sac_county_acs)[sac_city_border,] %>%
        st_drop_geometry() %>%
        left_join(select(sac_county_acs, GEOID)) %>%
        st_sf()
## Warning: st_centroid assumes attributes are constant over geometries
## Joining with `by = join_by(GEOID)`
head(sac_centroid_select)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 6691921 ymin: 1925751 xmax: 6731527 ymax: 2000728
## Projected CRS: NAD83(2011) / California zone 2 (ftUS)
##         GEOID                                              NAME   variable
## 1 06067009610 Census Tract 96.10, Sacramento County, California B01003_001
## 2 06067004017 Census Tract 40.17, Sacramento County, California B01003_001
## 3 06067002100    Census Tract 21, Sacramento County, California B01003_001
## 4 06067001602 Census Tract 16.02, Sacramento County, California B01003_001
## 5 06067006400    Census Tract 64, Sacramento County, California B01003_001
## 6 06067001601 Census Tract 16.01, Sacramento County, California B01003_001
##   estimate  moe                       geometry
## 1     7101 1093 MULTIPOLYGON (((6720102 193...
## 2     2648  360 MULTIPOLYGON (((6691921 193...
## 3     2313  408 MULTIPOLYGON (((6699199 196...
## 4     2401  467 MULTIPOLYGON (((6718420 196...
## 5     5458  994 MULTIPOLYGON (((6724549 199...
## 6     3595  667 MULTIPOLYGON (((6720702 196...
ggplot() +
    geom_sf(data = sac_centroid_select) +
    geom_sf(data = sac_city_border, aes(color = "red", alpha = .5))

That looks pretty good if I do say so myself.

Getting and cleaning bikelane data

Now we import the bike lanes.

 sac_bikelanes<-
    st_read("https://services5.arcgis.com/54falWtcpty3V47Z/arcgis/rest/services/Bike_Master_Plan_Facilities/FeatureServer/1/query?outFields=*&where=1%3D1&f=geojson") %>%
    st_transform(st_crs(sac_centroid_select))
## Reading layer `OGRGeoJSON' from data source 
##   `https://services5.arcgis.com/54falWtcpty3V47Z/arcgis/rest/services/Bike_Master_Plan_Facilities/FeatureServer/1/query?outFields=*&where=1%3D1&f=geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 4937 features and 6 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -121.5607 ymin: 38.43798 xmax: -121.3698 ymax: 38.68551
## Geodetic CRS:  WGS 84
head(sac_bikelanes)
## Simple feature collection with 6 features and 6 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 6705411 ymin: 1965711 xmax: 6722421 ymax: 1984850
## Projected CRS: NAD83(2011) / California zone 2 (ftUS)
##   OBJECTID    FULLSTREET UNIQUE_ID LENGTHMILES             CL_TYPE
## 1        3        9TH ST     22186        0.04  Class 2: Bike Lane
## 2        6       18TH ST     23873        0.04  Class 2: Bike Lane
## 3        7          T ST     60587        0.14  Class 2: Bike Lane
## 4        9          H ST     23378        0.13  Class 2: Bike Lane
## 5       12 EL CAMINO AVE     28871        0.06  Class 2: Bike Lane
## 6       15   WOODLAKE DR     18791        0.04 Class 3: Bike Route
##   Shape__Length                       geometry
## 1      194.4956 LINESTRING (6705473 1971875...
## 2      216.0246 LINESTRING (6708248 1968504...
## 3      744.3018 LINESTRING (6715692 1965953...
## 4      663.5481 LINESTRING (6721791 1969510...
## 5      337.3181 LINESTRING (6716145 1984842...
## 6      237.5436 LINESTRING (6718598 1982625...
ggplot() +
    geom_sf(data = sac_bikelanes, color = "purple") +
    geom_sf(data = sac_city_border, color = "red", alpha = .5)

ggplot() +
    geom_sf(data = sac_bikelanes, color = "purple") +
    geom_sf(data = sac_centroid_select, color = "red", alpha = .5)

the way that plotted looks kind of funky but I’m not gonna deal wit that just yet. Already, I notice that some of the bike lanes extend beyond the city borders, those will end up being removed from our analysis. looking at the head of the data, I notice that I have one column titled Shape__Length without a unit of measurement. I’m guessing it’s feet but we should be certain. we may need that information when trying to determine density of bike lanes by census tract. It says in the metadata on the city of Sacramento website that the unit of measurement is feet. We also have the length in miles measure which makes things clearer.

Something that I noticed is that it seems that there are a fuck ton of duplicated rows in this data set, which was really confusing me.

duplicate_lanes <- sac_bikelanes %>%
    filter(
        duplicated(select(., OBJECTID:Shape__Length)) | # the or operator `|` essentially work to return both selections without repeating those rows that are selected by both lines of code in the duplicated functions
        duplicated(select(.,OBJECTID:Shape__Length), fromLast = TRUE) #this row checks for duplicates starting from the bottom of the dataset. without doing this the rows that the previous line of code identifies as having duplicates later in the dataset would not be counted.
          )

dlane <- sac_bikelanes %>% #dlane is a list of all the rows in the dataset that have the same values as a previous row
    filter(
        duplicated(select(., OBJECTID:Shape__Length)))

I don’t know why, but the data repeats a whole bunch. duplicate_lanes is a dataframe that in all rows that appear more than once. dlane is a datafram that includes all rows that are a repeat of a previous row. since the number o rows in dlane is exactly half the number of rows in

time to get the unique values

unique_sac_bikelanes <- anti_join(sac_bikelanes, st_drop_geometry(dlane)) %>% #this line of code removes all rows that have a duplicate, so we need to add back dlane so that we have all the bike lanes
    rbind(dlane)
## Joining with `by = join_by(OBJECTID, FULLSTREET, UNIQUE_ID, LENGTHMILES,
## CL_TYPE, Shape__Length)`

now I want to see if the two plots of the orginal dataset and the unique rows dataset look about the same

ggplot() +
    geom_sf(data = sac_bikelanes, color = "purple") +
    geom_sf(data = sac_city_border, color = "red", alpha = .5)

ggplot() +
    geom_sf(data = unique_sac_bikelanes, color = "purple") +
    geom_sf(data = sac_city_border, color = "red", alpha = .5)

looks the same to me

The next challenge will be determining the density of bike lanes in each tract. Since the census tracts certainly dissect some of the bike lanes, we are going to have to find a way to find a numerical distance for the dissected portion of each bike lane.the dataset gives us a whole length for each total bike lane, but that wont be useful spatial selection tools don’t change other rows. In other words, each dissected row returned from a spatial selection will have the total length of the bikelane segment, not just the dissected protion.

we can dissect each bike lane by census tract using the st_intersection function

bikelane_intersection <- st_intersection(unique_sac_bikelanes, sac_centroid_select)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

st_intersection deletes any parts of the bikelanes that do not fall into the multipolygons in our y value. I have been getting confused as to why the difference between our value for unique_sac_bikelanes and bikelane_intersection is not very large while disssceted_lanes which (next code chunk) represents all the bike lanes that cross between 2 or more polygons is so much larger. I think the reason for that is because I unique_sac_bikelanes includes lanes outside of our census tracts.

in_tracts_bikelanes <-
    st_intersection(unique_sac_bikelanes,st_union(sac_centroid_select))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
nrow(bikelane_intersection) - nrow(in_tracts_bikelanes)
## [1] 441

The difference between the number of bikelane rows inside our acs tract borders of interest and the number of biklane rows in our acs border after being dissected by tract is 441. We’d then expect that the number of rows with repeated data (besides their new unique dissected geometry) to be 441.

dissect_repeat <- st_drop_geometry(bikelane_intersection) %>%
    filter(
        duplicated(select(., OBJECTID:Shape__Length)) 
          )
nrow(dissect_repeat)
## [1] 441

great! thants what we expected.

Now I want to plot my dissected bikelanes to make sure everything looks right

dissected_lanes <- st_drop_geometry(bikelane_intersection) %>%
    filter(
        duplicated(select(., OBJECTID:Shape__Length)) |
        duplicated(select(.,OBJECTID:Shape__Length), fromLast = TRUE)
          ) %>%
    left_join(select(bikelane_intersection, OBJECTID:GEOID)) %>%
        st_sf()
## Joining with `by = join_by(OBJECTID, FULLSTREET, UNIQUE_ID, LENGTHMILES,
## CL_TYPE, Shape__Length, GEOID)`
ggplot() +
    geom_sf(data = dissected_lanes, color = "purple") +
    geom_sf(data = sac_centroid_select, color = "red", alpha = .5)

that seems to make sense.

###creating a new variable: biklane density per plot

next step. I want to calculate the density of bikelanes within each tract. Right now, within my bikelane intersection data, I do not have a column representing the true length of each segment

bikelane_intersection <- bikelane_intersection %>%
    mutate(roadlenght = st_length(.))
view(bikelane_intersection)

this looks right. when we view our updated bikelane_intersection, the new roadlength column even outputs values that clarify that the unit of measurement as US survey foot. the values in this column are part of the units class. This means that when making calculations with multiple measurements it does automatic unit conversion.

now I am going to create a new dataframe that has the sum total bike lane length by each geoid

tot_bikelane_length_byGEOID <- 
    aggregate(data = bikelane_intersection,roadlenght ~ GEOID, sum) %>%
    rename(combined_roadlength = roadlenght)

okay cool it seems like we got we are looking for. We have 127 objects which matches the number of objects in sac_centroid_select and the sum total lenght of all the bikelanes matches that from bikelane_intersection

now we can join our combined bikelane road length data to our census tract data.

sac_centroid_select <- tot_bikelane_length_byGEOID %>%
    left_join(sac_centroid_select) %>%
    st_sf()
## Joining with `by = join_by(GEOID)`

now let’s calculate the area of each tract and caculate the desity of bikelanes per each tract by dividing the combined bikelane lenght by the area of each tract.

sac_centroid_select <- sac_centroid_select %>%
 mutate(tract_area = st_area(.))
 
ggplot() +
     geom_sf(data = sac_centroid_select, aes(fill = as.numeric(tract_area)))

sac_centroid_select <- sac_centroid_select %>%
    mutate(bikelane_density = combined_roadlength/tract_area)
 
ggplot() +
     geom_sf(data = sac_centroid_select, aes(fill = as.numeric(bikelane_density))) + 
     geom_sf(data = bikelane_intersection, color = "red", alpha = .5) 

That looks about right to me. I want to see how the spread of the bikelane density looks

hist(sac_centroid_select$bikelane_density, breaks = 10)

Looking at the spread of our data, it seems that there are handful of tracts with a very high bikelane density whereas most of the city hovers on the lower end. I wonder what the top 20 percent of tracts by bikelane density are.

print(
  quantile(sac_centroid_select$bikelane_density, probs = .8)  
     )
## 0.001076755 [1/US_survey_foot]
ggplot() +
     geom_sf(data = filter(sac_centroid_select,
                           bikelane_density >= quantile(sac_centroid_select$bikelane_density, probs = .8)
                          ),
     aes(fill = as.numeric(bikelane_density))
            ) + 
     geom_sf(data = bikelane_intersection, color = "red", alpha = .5) 

I wonder what the top 10 percent looks like

ggplot() +
     geom_sf(data = filter(sac_centroid_select,
                           bikelane_density >= quantile(sac_centroid_select$bikelane_density, probs = .9)
                          ),
     aes(fill = as.numeric(bikelane_density))
            ) + 
     geom_sf(data = bikelane_intersection, color = "red", alpha = .5) 

what about the bottom 20%

ggplot() +
     geom_sf(data = filter(sac_centroid_select,
                           bikelane_density <= quantile(sac_centroid_select$bikelane_density, probs = .2)
                          ),
     aes(fill = as.numeric(bikelane_density))
            ) + 
     geom_sf(data = bikelane_intersection, color = "red", alpha = .5)